[MoE] Align Swiglu MXFP4 fused quant paths by XiaobingSuper · Pull Request #3123 · ROCm/aiter

XiaobingSuper · 2026-05-11T07:52:24Z

Summary

Keep FlyDSL Swiglu MXFP4 fused quantization on the f32 activation path by removing the bf16 round-trip before FP4 quantization.
Preserve the requested Swiglu limit branch structure while keeping GPT-OSS Swiglu MXFP4 on the direct quantization path.
Align test_moe_2stage.py references with runtime Swiglu MXFP4 fused quant semantics by using an f32 stage1 reference for FP4 fused-quant cases.
Infer CSV gateMode from dtype/layout because tuned rows do not carry an explicit gateMode field.

Test plan

podman exec zxb_vllm_gptoss bash -lc 'cd /workdir/aiter_main && python3 -m py_compile op_tests/test_moe_2stage.py aiter/ops/flydsl/kernels/mixed_moe_gemm_2stage.py aiter/ops/flydsl/kernels/silu_and_mul_fq.py && git diff --check'
podman exec zxb_vllm_gptoss bash -lc 'cd /workdir/aiter_main && HIP_VISIBLE_DEVICES=1 FLYDSL_RUNTIME_CACHE_DIR=/tmp/flydsl_pr3123_test_fp4 AITER_CONFIG_FMOE=/workdir/aiter_main/aiter/configs/model_configs/gptoss_fp4_tuned_fmoe.csv python3 -m op_tests.test_moe_2stage --no-legacy'
podman exec zxb_vllm_gptoss bash -lc 'cd /workdir/aiter_main && HIP_VISIBLE_DEVICES=1 FLYDSL_RUNTIME_CACHE_DIR=/tmp/flydsl_pr3123_test_fp8fp4 AITER_CONFIG_FMOE=/workdir/aiter_main/aiter/configs/model_configs/gptoss_fp8fp4_tuned_fmoe.csv python3 -m op_tests.test_moe_2stage --no-legacy'
podman exec zxb_vllm_gptoss bash -lc 'cd /workdir/aiter_main && HIP_VISIBLE_DEVICES=1 FLYDSL_RUNTIME_CACHE_DIR=/tmp/flydsl_pr3123_test_legacy python3 -m op_tests.test_moe_2stage --no-flydsl-csv -t 1024 -dim 3072,3072 -e 128 -k 4 -q 4 -a swiglu -s f -p t -hip 0,0'

Test result

gptoss_fp4_tuned_fmoe.csv --no-legacy: passed 8 strict CSV cases, command exit code 0.
gptoss_fp8fp4_tuned_fmoe.csv --no-legacy: passed 7 strict CSV cases, command exit code 0.
Legacy Swiglu MXFP4 target case: passed, command exit code 0.

Made with Cursor

Remove the GPT-OSS Swiglu layout env switch in favor of GateMode, align the CSV test filter with runtime dtype selection, and restore FlyDSL Swiglu _fp4 fused quant accuracy by matching the non-fused bf16 stage1 semantics. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-11T07:53:05Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3123 --add-label <label>

Copilot

Pull request overview

This PR updates the Swiglu MXFP4 MoE codepaths to remove the legacy GPT-OSS layout environment switch, align runtime q_dtype_a selection with GateMode, and restore FlyDSL fused-quant numerical behavior to match the non-fused bf16 materialization/clamp semantics.

Changes:

Switch Swiglu MXFP4 q_dtype_a selection to be driven by GateMode.SEPARATED vs non-separated modes, and thread gate_mode through the 2-stage config path.
Update CSV-driven MoE 2-stage tests to skip cases whose q_dtype_a no longer matches the runtime Swiglu MXFP4 selection logic (now including gate_mode).
Adjust FlyDSL fused quant kernels to apply the Swiglu alpha/clamp path and bf16 round-trip prior to MXFP4 quantization to match the non-fused semantics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
`op_tests/test_moe_2stage.py`	Updates CSV-case filtering to match runtime Swiglu MXFP4 `q_dtype_a` selection, now factoring in `gateMode`.
`aiter/ops/flydsl/kernels/silu_and_mul_fq.py`	Aligns fused activation/clamp behavior for Swiglu and adds bf16 round-trip to match non-fused quant semantics.
`aiter/ops/flydsl/kernels/mixed_moe_gemm_2stage.py`	Adds bf16 materialization before MXFP4 quantization in the fused stage1 store path for Swiglu FP4.
`aiter/fused_moe.py`	Removes the GPT-OSS Swiglu MXFP4 layout env switch and keys runtime dtype selection/config dispatch off `gate_mode`.

Comments suppressed due to low confidence (1)

aiter/fused_moe.py:827

get_2stage_cfgs() now accepts gate_mode, but the tuned-config lookup keys (_INDEX_COLS / keys) do not incorporate it. If SEPARATED vs INTERLEAVE share the same q_dtype_a/q_dtype_w (e.g. Swiglu MXFP4 small-M where both may be bf16+fp4), this can cause the wrong tuned kernel to be selected or make it impossible to keep separate tuned entries. Consider threading gate_mode through the config index (and logging) so the selected kernel is unambiguous across gate layouts.

def get_2stage_cfgs(
    token,
    model_dim,
    inter_dim,
    expert,
    topk,
    dtype,
    q_dtype_a,
    q_dtype_w,
    q_type,
    use_g1u1,
    activation,
    doweight_stage1,
    hidden_pad,
    intermediate_pad,
    is_shuffled=True,
    gate_mode=GateMode.SEPARATED.value,
):
    gate_mode = GateMode(gate_mode)
    _INDEX_COLS = [
        "cu_num",
        "token",
        "model_dim",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Zzz9990 · 2026-05-12T01:19:50Z

I think those functions should be merge into this PR: #3129. And let's discuss a more suitable integration solution.

coderfeli

LGTM

The MXFP4 W4A16 weight-load path in oracle/mxfp4.py uses shuffle_weight_a16w4 (is_guinterleave=True), which interleaves gate/up columns within each weight tile. The CK/FlyDSL MoE kernels in aiter must be told this via gate_mode=GateMode.INTERLEAVE so they decode the gate/up packing correctly. Without the explicit gate_mode, aiter defaults to SEPARATED and (since ROCm/aiter#3123) dispatches the (SEPARATED + Swiglu + per_1x32 + fp4x2) case to a path that returns garbage for shuffled weights or crashes during CK2stages JIT for the unshuffled Quark variant (amd/gpt-oss-20b-w-mxfp4-a-bf16). This was the root cause of ROCM-25517 (gpt-oss-120b W4A16 gsm8k acc = 0) and ROCM-25478 (gpt-oss-20b Quark JIT crash). Other paths are unaffected: - FP8 W8A8 (DeepSeek-V4-Pro, DeepSeek-V3.2): shuffled with quark_ocp_mx.py:shuffle_weight(layout=(16,16)) — non-interleaved. use_mxfp4_w4a16 is False, default SEPARATED preserved. - MXFP4 W4A4 (amd/DeepSeek-R1-0528-MXFP4): shuffled via rocm_aiter_ops.shuffle_weights — non-interleaved. use_mxfp4_w4a16 is False, default SEPARATED preserved. The gate_mode kwarg was added to aiter.fused_moe in ROCm/aiter#3123 (aiter>=0.1.14). To stay compatible with older aiter shipping with vllm (e.g. aiter 0.1.13.post1 in the vllm-rocm:nightly image), we probe the aiter signature and drop the kwarg when unsupported — pre-vllm-project#3123 aiter tolerated the implicit SEPARATED default for interleave-shuffled weights, so dropping the kwarg is safe there. GateMode itself only exists on aiter>=0.1.14 and is imported under try/except for the same reason. Validation on MI355X (gfx950): vllm@main + aiter@main (6aeba41) openai/gpt-oss-120b W4A16 gsm8k: TP=1: 0.000 -> 0.905 TP=8: 0.000 -> 0.905 vllm@main + aiter@main amd/gpt-oss-20b-w-mxfp4-a-bf16 TP=2 enforce-eager: CK2stages JIT crash -> serves cleanly vllm-rocm:nightly + aiter 0.1.13.post1 openai/gpt-oss-120b W4A16 gsm8k: TP=1: 0.910 (backward-compat — gate_mode kwarg silently dropped) vllm-rocm:v0.22.0 + aiter@main openai/gpt-oss-120b W4A16 gsm8k: TP=1: 0.895 amd/gpt-oss120b-w-mxfp4-a-fp8 W4A8 (this PR composes with vllm-project#44804): TP=8 mc=1=326, mc=8=2087, mc=32=6523, mc=64=11610 tok/s Reference: sgl-project/sglang#25580 (sglang's equivalent fix). Recommended by aiter maintainer (XiaobingZhang) on ROCm/aiter#3586. Signed-off-by: Rohan Potdar <rohan.potdar@amd.com>

XiaobingSuper requested review from a team and Copilot May 11, 2026 07:52

Copilot started reviewing on behalf of XiaobingSuper May 11, 2026 07:54 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Comment thread aiter/fused_moe.py

[MoE] keep Swiglu MXFP4 fused quant in fp32

ad01c66

XiaobingSuper requested review from coderfeli and valarLip May 11, 2026 10:12

Merge branch 'main' into xiaobing/siglu_moe_new

6115eb2

coderfeli requested a review from Zzz9990 May 12, 2026 00:59

coderfeli approved these changes May 12, 2026

View reviewed changes

coderfeli merged commit ff9bf15 into ROCm:main May 12, 2026
42 of 45 checks passed

amd-bot mentioned this pull request May 15, 2026

[CI Monitor] Daily Report - 2026-05-15 bingxche/sglang-ci-bot#72

Open

This was referenced May 15, 2026

[Bug] GPT-OSS SwiGLU MXFP4 routes to unsupported CK2stages codegen #3227

Open

[WIP DO NOT MERGE] [AMD] fix(mxfp4): route AITER MXFP4+swiglu through FlyDSL gate_mode=INTERLEAVE sgl-project/sglang#25580

Closed

amd-bot mentioned this pull request Jun 1, 2026

[CI Monitor] Daily Report - 2026-06-01 bingxche/sglang-ci-bot#90

Open

srinivamd mentioned this pull request Jun 3, 2026

fix(moe): route SwiGLU MXFP4 unshuffled weights to CK-Tile instead of CK2stages #3518

Open

This was referenced Jun 7, 2026

fused_moe SEPARATED+Swiglu+MXFP4 dispatch produces all-zero outputs since aiter#3123 #3586

Closed

[ROCm][gpt-oss] Pass GateMode.INTERLEAVE for MXFP4 W4A16 fused MoE vllm-project/vllm#44893

Merged

jhinpan mentioned this pull request Jun 19, 2026

[Issue]: MXFP4 MoE low MFU at large shapes and long latency at small tokens ROCm/FlyDSL#708

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MoE] Align Swiglu MXFP4 fused quant paths#3123

[MoE] Align Swiglu MXFP4 fused quant paths#3123
coderfeli merged 3 commits into
ROCm:mainfrom
XiaobingSuper:xiaobing/siglu_moe_new

XiaobingSuper commented May 11, 2026 •

edited by wuhuikx

Loading

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Zzz9990 commented May 12, 2026

Uh oh!

coderfeli left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

XiaobingSuper commented May 11, 2026 • edited by wuhuikx Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Test result

Uh oh!

github-actions Bot commented May 11, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Zzz9990 commented May 12, 2026

Uh oh!

coderfeli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

XiaobingSuper commented May 11, 2026 •

edited by wuhuikx

Loading